Skip to content

ci: Change host tests to run in environment mode#5008

Merged
backspace merged 67 commits into
mainfrom
environment-mode-host-tests-cs-11275
Jun 17, 2026
Merged

ci: Change host tests to run in environment mode#5008
backspace merged 67 commits into
mainfrom
environment-mode-host-tests-cs-11275

Conversation

@backspace

@backspace backspace commented May 28, 2026

Copy link
Copy Markdown
Contributor

This lets us catch regressions with CI. Summary of changes:

  • live-test also switches to use environment mode so test-web-assets can still be shared with host shards
  • test-web-assets is configurable for environment mode (ci environment name)
  • mapping for https://localhost:4202/test in tests
  • hardcoded *.localhost:4201 (published), localhost:4202 (test), and localhost:4206 (icons) assertions are now dynamic
  • realm permissions are cloned for environment mode domains

This will tighten feedback cycles for host tests in parallel environments:

s 2026-06-12 at 15 55 46@2x

backspace and others added 2 commits May 27, 2026 17:41
Dev realms that are public-readable in standard mode (skills, catalog,
experiments) returned 401 to unauthenticated readers in environment
mode. Their public-read grants are seeded by static-URL migrations keyed
on localhost:4201, but environment mode mounts the same realms at
realm-server.<slug>.localhost URLs that no migration row matches. The
base realm is unaffected because its public grant is keyed on the
canonical https://cardstack.com/base/ URL.

Add an environment-mode-only step to the boot-time registry backfill
that mirrors the already-public set (matched by URL path) onto each
bootstrap realm's actual URL. It is idempotent and only grants read on
paths already declared public by policy, so it cannot expose a realm
that is meant to stay private.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boots the realm-server stack under BOXEL_ENVIRONMENT (Traefik, *.localhost
hostnames, per-slug database, dynamic ports) and asserts that the skills
realm — public in standard mode — also returns 200 to an unauthenticated
reader here, the parity the host integration tests depend on.

This is a deliberately small first step: it de-risks the env-mode CI
infrastructure (Traefik in CI, *.localhost resolution, env-mode boot)
before converting the full sharded host suite, and it guards the
public-read parity restored in the previous commit. A fixed "ci" slug is
safe because each job runs on an isolated VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   2h 0m 17s ⏱️ + 4m 22s
3 094 tests ±0  3 079 ✅ ±0  15 💤 ±0  0 ❌ ±0 
3 113 runs  ±0  3 098 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit fb2b91b. ± Comparison against earlier commit abb4a94.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   12m 57s ⏱️ +35s
1 731 tests ±0  1 731 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 824 runs  ±0  1 824 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit fb2b91b. ± Comparison against earlier commit abb4a94.

backspace and others added 13 commits May 27, 2026 19:40
Extends the environment-mode proof-of-concept to run the realm-touching
suite that the ticket reported failing in env mode, against the env-mode
dist. Validates the fix beyond the public-read curl check and is the
first step toward running the full host suite in environment mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bare 'ember' invocation failed with 'No such file or directory'
because node_modules/.bin was not on PATH; route through pnpm exec like
the other host test steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The public-read parity assertion is the job's gate; running the
realm-touching suite in env mode still surfaces harness issues unrelated
to that fix, so keep the slice visible but non-blocking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probes the prerender's failing cross-origin module fetch
(host.ci.localhost -> realm-server.ci.localhost/base/card-api) to
determine whether the realm-server behind Traefik emits an
Access-Control-Allow-Origin header for a cross-subdomain origin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boots the realm-server test stack under BOXEL_ENVIRONMENT (Traefik,
*.localhost hostnames, per-slug paths) and asserts the same public-read
parity gate the host PoC validates. Runs one shard of the realm-server
suite under the env-mode stack as a non-blocking slice while integration
fallout against env-mode services is matured. Reuses the env-mode CI
infrastructure proven by the host PoC (no new infra required).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shard 1/6 ran cleanly under env mode (244/0/0), so expand to a 6-shard
matrix mirroring the existing standard-mode realm-server-test job. Kept
non-blocking (continue-on-error on the test step) so the per-shard
public-read parity gate remains the required signal while any
integration-level fallout is matured. Promote to required once stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wait for /base/_readiness-check (the same gate test-services:host uses
to start its second stage, which covers full indexing including
prerender) before declaring readiness. Then pause 60s to let the
prerender's standby pool fully populate before launching the test
runner's own chrome instance, since concurrent chrome lifecycle events
trigger NetworkChangeNotifier and abort still-loading standby pages
with ERR_NETWORK_CHANGED. Visible in the previous run as 167
ERR_NETWORK_CHANGED events and FilterRefersToNonexistentType errors
from cards being indexed against a base whose modules table never
fully populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A botched edit left the prior parity step's bash tail merged into the
Settle step's run value, so the executed command became
'sleep 60 cat /tmp/skills_info.json ...' and sleep exited with code 2.
Replace with a clean run: sleep 60 and drop the now-redundant CORS
diagnostic step (CORS through Traefik is already confirmed working).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
card-catalog ran 10/10 cleanly under the env-mode stack with the
strengthened readiness gate + standby settle, so expand the host PoC
into the full 20-shard partitioned suite mirroring the standard
host-test job (HOST_TEST_PARTITION/COUNT consumed by
ember-test-pre-built). Kept non-blocking (continue-on-error on the test
step) so the per-shard public-read parity gate remains the required
signal while any env-mode-specific fallout across the full suite is
matured; promote to required once stable across all shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Path B: host test fixtures and helpers hardcode the live test realm
URL as https://localhost:4202/test/, which doesn't exist in environment
mode (the test realm-server runs at a per-environment Traefik hostname
instead). Expose the running URL as config.resolvedTestRealmURL (derived
from REALM_TEST_URL, which env-vars.sh sets in both modes) and have the
host's NetworkService rewrite the hardcoded URL to it at fetch time
when the two differ. Mirrors the existing addURLMapping pattern used
for the canonical base realm; no-op in standard mode and production.
Closes ~80 env-mode test failures concentrated in shards 4, 10, 11, 12
(realm-querying, realm-indexing, realm tests).

Cluster C: mirror the standard host-test job's chunk-fetch retry block
in the env-mode host job so a transient ChunkLoadError / Failed to fetch
dynamically imported module / NetworkError aborting a whole shard before
any tests run gets one retry. Previously cost shard 13 its entire run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The appendFileSync call in env-mode-lock.js (added in #5021, merged
into this branch via 4d9c587) was multilined past prettier's print
width, breaking the Lint job's prettier check. Format it on a single
line so CI lint goes green on this branch; the fix is in main territory
and can be cherry-picked there independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolvedTestRealmURL was added to environment.js in the previous commit
but missing from environment.ts's exported config type, so network.ts
saw it as unknown and ember-tsc failed. Declare it as string.

local-testing.md picked up a trailing-whitespace prettier warning
in the merge from main; reformat to keep the Lint job green here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backspace and others added 13 commits June 1, 2026 09:05
Last run showed Path B closing ~66 env-mode failures (shards 4/10/11/12)
but introducing 24 new failures concentrated in shard 9
(Integration | operator-mode | links: 'waitFor timed out waiting for
[data-test-stack-card=http://test-realm/test/BlogPost/1]'). The same
suite passes 103/0 in standard mode on the same SHA, so the regression
is env-mode-specific and either Path B or a commit pulled in via the
merge from main caused it.

Comment out the addURLMapping call (keep the config + type plumbing) so
the next CI run isolates which: if shard 9 returns to 0 and shards
10/11/12 regress, Path B was the source; if shard 9 stays at 24, the
merge introduced it independently.

Not a final state — this commit is meant to be reverted/replaced once
the source is identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Path B was first added (commit 746e9bd) it caused a 24-test
regression in shard 9's Integration | operator-mode | links: half the
host-app code paths went through VirtualNetwork (with the new mapping)
while the other half still went through the deprecated global
prefixMappings (which didn't), producing asymmetric URL resolution that
broke card-instance lookups.

CS-10752 has since landed in full — the global prefixMappings table and
the deprecated card-reference-resolver module are gone from main, so
every resolution site now goes through VirtualNetwork uniformly. The
asymmetry that broke shard 9 no longer has anywhere to live, so the
mapping is safe to restore. With it back, ~66 hardcoded-URL test
failures across shards 4/10/11/12/15 should close again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env-mode parity gate was waiting on base/_readiness-check, which
returns 200 once the base realm finishes its from-scratch index but
leaves skills indexing still in flight (the realm-server indexes
base and skills sequentially). The AI-assistant tests then fetch
Skill/boxel-environment from the skills realm and get 404 because
skills hasn't been indexed yet, accounting for the 11+7=18 ai-assistant
failures clustered in shards 7 and 17.

Add a parallel wait on skills/_readiness-check so the test step does
not start until both base and skills have completed indexing. Standard
mode does not need this because the host-test job's wait orchestration
naturally delays past skills indexing via its longer asset-restore
path; env-mode CI builds the dist inline and gets to the gate earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Env-mode CI runs `pnpm build` (in packages/host) before launching
test-services, which materializes the pnpm workspace and creates an
empty packages/skills-realm/contents directory. services/realm-server
later invokes `pnpm skills:setup`, whose `[ -d contents ] || git clone`
heuristic sees the directory already exists and skips the clone — so
the skills realm boots with no content. The realm's `#startup` finds
no files to index, completes immediately, and `_readiness-check`
returns 200 promptly. Tests then fetch `@cardstack/skills/Skill/
boxel-environment` and 404, breaking every ai-assistant-panel | skills
and | commands test in env mode (~18 failures total).

Standard mode doesn't hit this because it downloads a pre-built dist
artifact rather than running `pnpm build`, so contents/ doesn't exist
when services/realm-server's skills:setup check runs and the clone
fires normally.

Add an explicit skills:setup step in both env-mode jobs (host and
realm-server) before any pnpm workspace op runs, so contents/ has real
boxel-skills content on disk when test-services starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skills:setup script's SSH-then-HTTPS fallback chain has a silent
failure mode in the env-mode CI: SSH clone fails with 'Permission
denied (publickey)' as expected (no SSH key in CI) but leaves a
non-empty contents/ directory, which the HTTPS clone fallback then
refuses to overwrite. The script exits 0 because git clone's failure
output is swallowed in the chain, so the step succeeds but
packages/skills-realm/contents has no real content. The skills realm
boots empty, _readiness-check returns 200 fast (no work), and tests
404 on Skill/boxel-environment.

Replace the pnpm skills:setup call in both env-mode CI jobs with a
direct, verifiable HTTPS clone: bypass SSH, skip if an actual Skill/
directory is already present (idempotent), and ls the result so any
future regression is visible in the log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arts

Two prior fix attempts populated packages/skills-realm/contents and ls confirmed Skill/boxel-environment.json was present, but the skills realm still indexed zero files in env mode. Either something between the populate step and the realm-server's NodeAdapter read is clobbering the content, or the realm-server is reading from a different path. Add an explicit ls at three vantage points (workspace-relative, Skill subdir, realm-server-cwd-relative) right before test-services start to nail down which case applies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scoped-fromUrl bootstrap realms (skills, openrouter) take their
realm.url from the env-mode served URL, not the canonical, so
the static-URL migration rows in realm_user_permissions never
match. getRealmOwnerUserId then throws "Cannot determine realm
owner" inside from-scratch-index on boot, which fullIndex catches
and swallows -- the realm mounts but indexes zero files. Every
ai-assistant-panel | skills test then 404s.

Extend the env-mode parity step to mirror realm-owner, write, and
named-user permissions by URL pathname (was: only *: read). Keys
off existing standard-mode rows so it stays in lockstep with the
migrations; per-(username, env-url) check preserves any custom
admin permission across reruns.
The realm-server tests run against a template DB whose migrations
already seed standard-mode permission rows for /skills/, /catalog/,
etc. Tests that deepEqual the full row set at the env-mode URL pick
those up alongside their own inserts and yield false "actual"
results: my coalesces + custom-row tests saw three rows including
* and @skills_writer instead of the one row they set up.

Switch the three deepEqual tests to a synthetic /probe-realm/
pathname that no migration touches. The publicReadGranted tests
stay on /skills/ since they assert one property and aren't
affected by extra rows.
The env-mode host-test job invokes Percy inline as
`percy exec --parallel -- pnpm ember-test-pre-built`, bypassing the
`test:wait-for-servers` wrapper this script used to fan into.
Nothing else in the repo calls it.
The URL-equality guard already short-circuited in prod (resolvedTestRealmURL
defaults to the same hardcoded localhost:4202/test/ value the mapping
points away from), so this is a documentation + defense-in-depth move.
Matches the existing isTesting() pattern at monaco.ts / import.ts /
auth-service-worker-registration.ts.
@backspace backspace marked this pull request as ready for review June 12, 2026 19:56

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bd0aff5dc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/host/config/environment.js Outdated
backspace added 17 commits June 12, 2026 15:06
The env-mode host bundle is built by test-web-assets.yaml with
BOXEL_ENVIRONMENT=ci set but no REALM_TEST_URL — environment.js
was falling back to the hardcoded standard-mode default
`https://localhost:4202/test/`, so the bundle baked the wrong
value for testModuleRealm and the NetworkService URL rewrite saw
hardcoded==resolved and short-circuited. Card-data references to
testModuleRealm that escape the in-process realm-server mock would
hit a localhost:4202 listener that doesn't exist in env-mode CI.

Add testRealmURL to environmentDefaults() — standard mode keeps
the original `https://localhost:4202/test/`, env mode derives
`https://realm-test.<slug>.localhost/test/` from the slug the
same way the other URLs already do. resolvedTestRealmURL falls
back to defaults.testRealmURL; explicit REALM_TEST_URL still wins
for callers that want a custom endpoint.
When the workflow ran \`percy\` directly via \`dbus-run-session -- $TEST_CMD\`,
the percy binary in \`node_modules/.bin\` wasn't on PATH and the shard
exited with \"failed to exec 'percy': No such file or directory\". Going
through pnpm puts the local bin dir on PATH.

The script body differs from what was on main (env-mode tests don't
need the \`test:wait-for-servers\` wrapper — the workflow's parity gate
already waits for the live realm-server / matrix to come up).
…-tests-cs-11275

# Conflicts:
#	packages/host/tests/integration/realm-indexing-test.gts
The parity gate waits on base + skills _readiness-check before
running tests, but `icons.<slug>.localhost` has no readiness probe.
A run where the Traefik route for icons hasn't registered yet (or
where the http-server hasn't bound its port) lets the test step
start anyway, and the first card that imports an icon fails with
`TypeError: Failed to fetch` instead of a meaningful 4xx.

Add a probe for a stable icon file (`folder-pen.js`) to both parity
gates (live-test and host-test). 30 attempts at 2s = 60s headroom,
which is comfortably more than the icons server's observed startup
time but won't add noticeable wall-clock when it's healthy.

This rules out the boot-time race as a cause of the recurring
mid-run `Failed to fetch` against icons.ci.localhost. If failures
persist after this lands, the cause is mid-run service drop, not
readiness.
…real card

Two related fixes for env-mode host CI:

1. Run `pnpm register-realm-users` in both ci-host.yaml jobs
   (live-test and host-test), matching what standard-mode CI already
   does. Without this, the realm-server's worker fetches `_mtimes`
   from the dev realm-server unauthenticated, the response is 404,
   and from-scratch indexing finishes with zero files. The base
   realm then 404s every card the host bundle loads
   (welcome-to-boxel.json, ai-app-generator.json,
   join-the-community.json, cards/skill, Skill/catalog-listing,
   …), which cascades into the AI Assistant, create-file, and
   highlight-cards tests failing on missing UI elements.

2. card-delete's "can delete a card that is a selected item" set
   stack 1 to the bare realm URL `${testModuleRealm}`. The live
   realm-server doesn't resolve a bare realm URL as a card, so it
   404s. Point stack 1 at `${testModuleRealm}index` instead — the
   index card exists in test-realm-cards/contents/, so the load
   succeeds in both standard and env mode without changing the
   test's intent (two distinct stacks for the selection assertion).
Env-mode CI sometimes sees a momentary `Failed to fetch` against
`icons.<slug>.localhost` mid-run — a brief Traefik route loss or
service-side hiccup. The realm-server hostnames see the same blip,
but the realm fetch path has its own `withRetries` and recovers
silently; the icons path doesn't, so the failure surfaces in
whichever test imports an icon at the wrong instant.

Treat this as the same retry-category as the existing chunk-fetch
transient: broaden the shard retry regex to match
`unable to fetch https://icons\.<host>: fetch failed`, which is
specific to the env-mode wire URL and won't accidentally retry on
real test failures.
Two integration tests
(`realm: realm can serve GET card requests with linksTo relationships to
other realms` and `realm can serve search requests whose results have
linksTo fields`) fetch `${testModuleRealm}hassan` from the live test
realm. In env-mode CI, the test realm-server occasionally finishes its
from-scratch index with zero files when the dev realm-server is
heavy-indexing on the same runner (shared matrix + prerender pool),
producing a 404 from a realm that should serve hassan.

Inverse-correlated across shards — shard A: dev indexes 0 (matrix
race), test indexes 75 (passes); shard B: dev indexes 200, test
indexes 0 (fails). Re-running the shard normally lands after the
race resolves.

Broaden the existing shard-retry regex (already covers chunk-fetch
and icons transients) to also match
`cross-realm fetch failed for https://realm-test.<host>` so this
class of env-mode boot race gets one automatic retry. The regex is
specific to the env-mode wire URL and the cross-realm-fetch log
prefix, so it won't mask real linksTo bugs in tests using
`http://test-realm/...` in-process URLs.
Same env-mode boot race as the prior `cross-realm fetch failed`
pattern: when the test realm-server's from-scratch index finishes
with zero files, any test that reads from `${testModuleRealm}` 404s.
A `linksTo`-driven fetch goes through the loader's cross-realm path
and produces `cross-realm fetch failed for https://realm-test.…`;
a direct store.get() against a card URL produces a raw
`Could not find https://realm-test.…` from the realm-server's
notFound response. Add the second pattern alongside the first so
both forms get one automatic shard retry.
…path

The realm-server worker fetches `<realm>/_mtimes` on boot to discover
which files to index. In env mode this URL is
`https://realm-test.<slug>.localhost/test/_mtimes` and is reached via a
local Traefik. If the worker's first attempt lands before Traefik has
picked up the realm-server's dynamic route file, the connection is
reset (ECONNRESET) and the from-scratch-index job is rejected. Because
`Realm#startup` fires-and-forgets the fullIndex for `isNewIndex=false`
bootstrap realms and the rejected job leaves `realm_versions` at
`current_version=0`, subsequent realm-server reboots don't re-attempt:
the realm stays mounted but unindexed and every later card fetch 404s.

`shouldRetryFetch` already covered bare `localhost` and `127.0.0.1`,
plus the base-realm canonical, plus the production icons CDN, but it
short-circuited on `__environment !== 'test'` before reaching any of
those branches. Worker processes (`worker.ts`) don't set that global —
only `main.ts` does, when NODE_ENV=test — so in a worker the gate
always rejected and no retry fired regardless of which host the URL
named.

Move the `*.localhost` suffix check ahead of the `__environment` gate.
`*.localhost` is reserved for local development and tests by RFC 6761,
so retrying it can never affect production traffic, which lets the
check live above the gate without needing each consumer to opt in to
test mode. Repro: `worker-pid-9274` re-ran the same realm with the new
code and indexed 18 instances / 43 files; `can delete a card that is a
selected item` then passes 5/5 in the browser.
…T set

The prior commit's retry branch fired for any `*.localhost` URL,
which expanded the existing retry surface into the standard-mode
realm-server tests too. Those tests POST to
`testuser.localhost:4445` from supertest into a publish handler whose
internal fetches go through VirtualNetwork; a `withRetries` chain on
each transient internal fetch (10 attempts, ~5.5s total backoff)
stacks past the test's 60s timeout and surfaces as "socket hang up"
on the supertest side, leaving `publishResponse.body` undefined and
yielding the 403-instead-of-202 cascade across tests 157/161/164/167
of `publish-unpublish-realm-test.ts`.

Tighten the gate: the env-mode boot race only matters in processes
spawned under `BOXEL_ENVIRONMENT=<slug>` (env-mode workers,
prerenderer, realm-server). The standard-mode realm-server-test job
doesn't set the variable, so the retry stays off there and the
publish/unpublish tests keep their fail-fast semantics.
…y gate

`start-server-and-test` releases as soon as
`realm-test.<slug>.localhost/node-test/_readiness-check` returns 200,
but node-test and `/test/` are mounted sequentially by the same
realm-server process — so the gate can clear while `/test/` is still
running its from-scratch index. Fast host tests that load cards from
`${testModuleRealm}` (= the live `/test/` realm) then race the
indexer and see a 404, surfacing in Percy as "Card Error: Not Found"
even though no QUnit assertion fails.

Add a 60×5s wait on `realm-test.<slug>.localhost/test/_readiness-check`
so the test step doesn't launch until the live test realm is actually
indexed. Sits after the base/skills/icons checks; only the host-test
job needs it (live-test doesn't touch `testModuleRealm`).
Two issues from the latest run:

1. The icons shard-retry regex used `https://icons\.[^/]+: fetch failed`,
   which never matches because the actual log line is the full URL —
   `https://icons.ci.localhost/@cardstack/.../bell.js: fetch failed for
   ...`. With `/` excluded from the hostname character class, the
   match fell off after the first `/`. Switch to `[^:]+` so the
   path component is allowed in the URL before the literal
   `: fetch failed` tail.

2. Both bootstrap realms on the test realm-server's process (`/test/`
   from the mktemp dir and `/node-test/` from the checked-in
   `./tests/fixtures/realistic`) finish their from-scratch index with
   `files_completed=0`. No `mtimes request failed` log appears, so
   the worker is getting a 200 with an empty `mtimes` payload — the
   realm-server is walking what looks like an empty directory. The
   matching dirs are populated in a local repro on the same commit,
   so something in the CI shard environment is producing the
   discrepancy. Log file counts of the temp dir AND the source
   fixtures dir immediately after the cp so the next CI log makes
   it obvious whether the cp ran, whether the source checkout is
   missing, or whether something is clearing them between cp and
   realm-server boot.
The realm-server writes its Traefik dynamic route file in
`registerService`, then immediately the `listening` callback returns
and the reconciler kicks off the from-scratch index. Traefik picks
the file up via inotify, but the window between the write and the
route going live is wide enough that the worker's first
`<realm>/_mtimes` request can land before the route exists. Traefik
serves its default `404 page not found` body, the existing handler
logs `mtimes request failed` and returns `{}`, and the indexer
finishes with `files_completed=0` — the realm stays mounted but
unindexed for the rest of the process's life and every later card
fetch 404s. CI host-test shards see this as the
`Could not find https://cardstack.com/base/...` cascade on AI
Assistant / create-file / card-delete tests, plus the same shape
on `realm-test.<slug>.localhost/test/...` for the test
realm-server.

`shouldRetryFetch`'s `*.localhost` branch (added earlier this PR)
only fires on thrown errors; a 404 response is a successful
fetch, so withRetries doesn't kick in. Handle it where the
shape is recognizable: every realm-server response carries the
`X-Boxel-Realm-Url` header, so its absence on a non-OK response
means the response came from Traefik or another intermediary,
not the realm-server itself — i.e. the route isn't live yet.
Retry with linear backoff (10 attempts × 200ms..2s = ~11s worst
case) while the header is missing. Once it appears, fall through
to the original handler — a real realm-server 404 still logs
and returns `{}` as before.
@backspace backspace requested a review from a team June 17, 2026 02:14
@habdelra habdelra requested a review from Copilot June 17, 2026 13:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the host CI and test harness to run host tests against the environment-mode (*.{slug}.localhost) service stack, aiming to reduce CI flakiness and catch regressions earlier by making URLs, readiness checks, and permissions parity work in both standard and env modes.

Changes:

  • Add env-mode–aware retry/backoff behavior for realm worker fetches during boot, and broaden virtual-network retry gating for env-mode .localhost hosts.
  • Make host tests and helpers derive “test realm”, icons, matrix, and published-realm URLs dynamically from resolved environment config instead of hardcoding ports.
  • Update CI workflows and test asset build pipeline to bake BOXEL_ENVIRONMENT into the host dist, start Traefik, and add stronger readiness / parity checks.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/runtime-common/worker.ts Adds backoff retries for _mtimes during env-mode Traefik route boot race
packages/runtime-common/virtual-network.ts Enables fetch retries for env-mode .localhost hosts when BOXEL_ENVIRONMENT is set
packages/realm-server/tests/realm-registry-backfill-test.ts Adds tests asserting env-mode permission parity seeding behavior
packages/realm-server/lib/realm-registry-backfill.ts Seeds env-mode permission parity from standard-mode permission rows (path-matched)
packages/matrix/support/docker.ts Minor precedence cleanup in docker pull error message handling
packages/matrix/helpers/index.ts Formatting-only change to a Playwright locator call
packages/host/tests/integration/realm-test.gts Replaces hardcoded test realm module URLs with testModuleRealm
packages/host/tests/integration/realm-indexing-test.gts Makes test realm + icons URL assertions env-mode compatible
packages/host/tests/integration/enum-field-test.gts Uses testModuleRealm when creating cards from serialized docs
packages/host/tests/integration/components/serialization-test.gts Uses testModuleRealm for refs/adoptsFrom assertions
packages/host/tests/integration/components/operator-mode-card-chooser-test.gts Uses testModuleRealm for search input in assertions
packages/host/tests/integration/components/card-delete-test.gts Uses testModuleRealm in operator-mode state setup
packages/host/tests/helpers/index.gts Defines testModuleRealm from resolved host config; uses resolved matrix URL
packages/host/tests/acceptance/interact-submode-test.gts Uses testModuleRealm for active realms and IDs
packages/host/tests/acceptance/host-submode-test.gts Derives published realm host from ENV rather than hardcoded localhost:4201
packages/host/tests/acceptance/commands-test.gts Uses testModuleRealm in adoptsFrom module URLs
packages/host/tests/acceptance/code-submode/recent-files-test.ts Uses testModuleRealm in recent-files fixture data
packages/host/tests/acceptance/code-submode/file-tree-test.ts Uses testModuleRealm for cross-realm navigation URLs
packages/host/tests/acceptance/code-submode-test.ts Uses testModuleRealm for active realms + binary file navigation
packages/host/scripts/live-test-wait-for-servers.sh Makes readiness URLs scheme/host dynamic for env-mode
packages/host/package.json Adjusts test-with-percy to run ember-test-pre-built directly
packages/host/config/environment.js Adds resolved test realm URL config for env-mode / CI use
packages/host/app/services/network.ts Adds a test-only URL mapping from hardcoded test realm URL to resolved test realm URL
packages/host/app/config/environment.ts Adds resolvedTestRealmURL to the typed environment config
mise-tasks/services/test-realms Adds diagnostics logging for fixture copy file counts in env-mode CI
.github/workflows/test-web-assets.yaml Adds boxel_environment input, bakes it into cache/artifacts and build env
.github/workflows/ci-host.yaml Runs host workflows in env-mode, starts Traefik, waits for readiness/parity, updates retry patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/runtime-common/worker.ts
Comment thread packages/host/config/environment.js Outdated
In Node/undici, an un-consumed Response keeps the underlying
connection reserved until GC. A 10-attempt retry loop on each indexed
realm at boot would leave up to 10 dangling response bodies per
realm-server, pinning sockets across the backoff window for no
benefit — the body is Traefik's "404 page not found" which we never
use. Cancel the body before sleeping so the connection returns to
the pool immediately.
… URL

The previous form always appended `/test/`, so a value like
`https://my-host/test/` produced `https://my-host/test/test/`. Detect
the case where the override already names the `/test` realm and just
normalize the trailing slash; otherwise keep the existing
append-`/test/` behavior for the base-host shape that the in-repo
`env-vars.sh` uses.
@backspace backspace merged commit 2c77de4 into main Jun 17, 2026
73 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants